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Comment: Quantifying the Fraction of 
Missing Information for Hypothesis 
Testing in Statistical and Genetic Studies 
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INTRODUCTION 

The authors suggest an interesting way to measure 
the fraction of missing information in the context of 
hypothesis testing. The measure seeks to quantify 
the impact of missing observations on the test be- 
tween two hypotheses. The amount of impact can be 
useful information for applied research. An example 
is, in genetics, where multiple tests of the same sort 
are performed on different variables with different 
missing rates, and follow-up studies may be designed 
to resolve missing values in selected variables. 

In this discussion, we offer our prospective views 
on the use of relative information in a follow-up 
study. For studies where the impact of missing ob- 
servations varies greatly across different variables 
and where the investigators have the flexibility of 
designing studies that can have different efforts on 
variables, an optimal design may be derived using 
relative information measures to improve the cost- 
effectiveness of the follow-up. 

Using the simple motivation example in their pa- 
per, we examine the estimation of relative informa- 
tion by TZI\ and TZIq in terms of unbiasedness and 
variability, and discuss issues that require further 
research. Although the relative information measure 
developed in their paper estimates the mean impact 
of the missing data, the actual impact may be highly 
variable when the amount of information in the ob- 
served data is moderate or small, which makes the 
estimated mean relative information a less reliable 
prediction of the actual impact of missing observa- 
tions. For this reason, we suggest a simple way to 
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estimate the variability of relative information be- 
tween complete data and observed data in the sim- 
ple motivation example. Further investigation is re- 
quired in incorporating these variability estimates 
into the optimal design of follow-up studies. 

RELATIVE INFORMATION AND FOLLOW-UP 
STUDY DESIGNS 

Missing values can occur for many reasons and can 
have different effects on a given test. Nicolae, Meng 
and Kong pointed out that the impact of missing 
values (in terms of relative information) on a test 
may not be as simple as the "face value" of no/n, 
where no is the number of observed values and n is 
the number of individuals (n — uq is then the num- 
ber of missing values). Therefore, a more accurate 
estimation of the information gain due to the reso- 
lution of missing values is important for the design 
of follow-up studies. 

Given an existing data with n individuals (with 
missing values), if n\ additional independent sam- 
ples are collected (possibly with the same missing 
rate) to expand this data set, it is intuitive to assume 
that the ratio of information in the original data 
and the expanded data is approximately n/(n + ni). 
Now consider a test on the existing data with n in- 
dividuals that has some missing values (say, no ob- 
served values) . The relative information is estimated 
to be 80%, meaning that if the data used for this 
test is "resolved" to become complete, the expected 
log likelihood ratio is about 1/80% = 125% of the 
observed log likelihood ratio. To achieve the same 
level of information by adding new independent ob- 
servations, one would need to collect a sample of 
additional n\=nx 25% individuals. In many situa- 
tions, resolving missing values, if possible, turns out 
to be much cheaper than collecting data on addi- 
tional samples. In Section 2 of the NMK paper, an 
example was given on genotyping ambiguity in ge- 
netic linkage analysis (meaning that the exact inher- 
itance vectors needed for the lod score computation 
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cannot always be derived given the genotypes ob- 
served on the individuals). Here, let be current 
data with unambiguous genotypes. For a follow-up 
study, a researcher can decide between (1) increas- 
ing the density of genetic markers on the observed 
individuals to resolve the ambiguities and (2) in- 
creasing the sample size by genotyping more inde- 
pendent individuals on the same set of markers for 
the previously observed individuals. If we denote the 
two potential expanded data sets as Y C o,m and Y co ^ 
with m and i standing for markers and individu- 
als, we can compute the fraction of information be- 
tween Y Q b and Y COtTn , and between Y^ and Y COj i, po- 
tentially using 1ZI\ and 1ZIo proposed in the NMK 
paper. Comparing these two measures of relative in- 
formation, the researcher can then decide which op- 
tion (increasing markers or increasing individuals) 
is cost-efficient for the inferential task at hand. 

In practice, one would need to consider such com- 
parison at multiple variables simultaneously. Here 
we consider a simple example. Let {Y\, . . . ,Ym} be 
the variables studied. For Yi, n-o,i values are observed 
on n individuals. In a follow-up study n\ j missing 
values can be resolved at Yi . At Yi , the relative infor- 
mation (say, HIi) is a function of n\ the observed 
lod score lod b,i and the observed m.l.e. To evalu- 
ate the overall information gain due to these addi- 
tional observations, we suggest an expression similar 
to that of (19) in the NMK paper 1 : 

Tlh (tti.i) • • • ,n 1M ) 

(1) 

_ Z]j=l lod ob,i^l(^l,t) 1 
52i=i l° d ob,i 

A possible way to yield an optimal design would be 
to select values of < n\ i < n — no,« to maximize 
the information gain while controlling for a fixed 
cost. Differences in design may involve varying setup 
costs that may depend on, for example, the number 
of nonzero ni j such as that in genotyping studies. 
Once such a cost function can be fully specified, lin- 
ear programming can be used to obtain the optimal 
design. If the n\/s in the optimal design identified 
take similar values on £ = I, ... , M, this may suggest 
a design that collects data on m new independent 
individuals and takes measurements on the same M 
variables as in the original data. 



1 Equation (19) in the original paper is to combine relative 
information measures from several studies, while (1) here is 
to evaluate relative overall information of multiple variables. 



Another advantage of the likelihood ratio-based 
evaluation of information used by Nicolae, Meng and 
Kong is that one can evaluate the potential informa- 
tion gain conditioning not only on the observed data 
at the current concerned variable but also on some 
associated variables, through a model-based calcula- 
tion. Similar model-based strategies have been com- 
monly used for imputing missing genotypes in ge- 
netic studies. Such consideration may introduce more 
complicated design questions than the computation 
in (1) but may also bring better efficiency. 

THE "EMPIRICAL" FRACTION OF 
INFORMATION AND ITS VARIABILITY 

Using the simple motivation example in Section 
1 of the NMK paper, we consider the relation be- 
tween the empirical observed data log likelihood ra- 
tio (lod score) and the "random" complete data log 
likelihood ratio (lod score). We offer relationships 
between the proposed fraction of information and 
the distribution of the "empirical" ratio. The "em- 
pirical" ratio is the actual random gain due to addi- 
tional observations, while the estimation of relative 
information and the possible optimal design derived 
are intended to approximate this random outcome. 

In Figure 1, we plot the joint distribution of the 
lod scores under the observed data and the com- 
plete data, with missing percentage being 80%. The 
distribution is evaluated under three true values of 
the probability of success with hq = 800 and n = 
1000. To obtain a realistic evaluation, we use the 
traditional definition of the likelihood ratio test (or 
the lod score) where the ratio is evaluated between 
the maximum likelihood estimate given current data 
(observed or complete) and the value in the null hy- 
pothesis. 

We first notice the positive correlation between 
the complete data statistic and the observed data 
statistic. Gray broken lines in Figure 1 give refer- 
ence lines for empirical or "random" ratio between 
the complete data lod score (or log LR statistic) 
and observed lod score. The estimated 1ZI\ (which 
coincides with r = no/n) corresponds to a line going 
through the center of the joint distribution (almost 
exactly), indicating it is a good estimate for the ex- 
pected ratio (or fraction of information) regardless 
of the values of the observed lod score. 

For a small departure (say, p = 0.55) from the null 
hypothesis (po = 0.5), the LR test does not have 
great power and the test statistics distribute close to 
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p= 0.525 p= 0.55 p= 0.65 




observed data log Lfl observed data log LR observed data log LR 



Fig. 1. Distribution of log likelihood ratio test statistics (or lod scores) given observed data and complete data. The contour 
plots display the joint distribution of the log likelihood ratio test statistics given the observed data and the complete data. Given 
no = 800 and n = 1000, the ratio between the complete data log LR and the observed data log LR is expected to be n/no = 1.25. 
Ln each contour plot, a dotted line is plotted to indicate the y = 1.25a; line. The gray broken lines display y = rx with r varying 
and provide reference for the empirical ratio of the complete data log LR and the observed data log LR. 



zero. The contour of the distribution intersects with 
lines whose ratio values are shown to go as high as 
13. This is natural given the observed data statis- 
tic can become very small due to chance and cre- 
ate a highly variable ratio. For values that are far 
away from the null hypothesis, the estimated IZIi 
becomes more precise. 

As illustrated above and in Figure 1, the unob- 
served random missing values make the relative "em- 
pirical" information a random quantity. It is instruc- 
tive to evaluate the amount of variation in the com- 
plete data lod score. It is easy to obtain for the sim- 
ple binomial example that 



tion 



(2) 



var [lod {pi , p 2 ; Y co ) \ Y oh , p] 



(n - n )p(l-p) 



, Pi , 1-Pl 
log log 

P2 1 - P2 



Consider a null hypothesis that specifies the prob- 
ability of success as po and let p be the true param- 
eter value. Let 7ZI y (Y co , Y \,;p,po) be the empirical 
fraction of information regarding the difference be- 
tween p and po, for a set of Y co with only Y^ ob- 
served (or the ratio of the lod scores between p and 
Po derived using the observed data and the potential 
complete data). It is easy to see that T^-Iy 1 is a more 
natural relative information ratio to use for evaluat- 
ing overall relative information in (1) and identifying 
optimal follow-up design. From similar computation 
in (2), IZIy , conditioning on Y ^, has an expecta- 



ETZIy 1 



1 + (n - n ) 

•(lod(p,p ;^ob)r : 



p\og h (1 -p) log- 

Po (1-Po) 



and variance 

var TZIy 1 = (n — no)p(l — p) 
■ (lod(p,p ;^ob) 2 ) 



, P , 1-P 

log log 

Po 1 - Po 

-i 



In practice, we may substitute p with p ^ and have 

Willy 1 estimated by TZI^ 1 . Fi gure 2 gives the es- 
timated standard deviation of Uly with probabil- 
ity density curves under different true values of p. 
When the true value is close to the null hypoth- 
esis po, THy 1 is highly variable, which will make 
the simple estimate of IZI^ 1 as an estimated expec- 
tation of Tlly 1 a unreliable prediction of TZI~ . A 

procedure incorporating both WKIy 1 = TZI^ 1 and 
an estimated standard error of 'R-Iy should be con- 
sidered to address the design issues similar to that 
of (1). 

IN SUMMARY 

The paper by Nicolae, Meng and Kong provides 
interesting evaluation strategies for relative infor- 
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Fig. 2. Estimated standard deviation of TZIy 1 . For sample size n = 100,1000, we plot the estimated standard deviation of 
IZIy against the observed number of successes xq . Density curves of observed number of successes xo under different true p 
values are plotted. 



mation discerning two hypotheses contained in ob- 
served data. Such measures support the quantifica- 
tion of possible information gain that can be brought 
by additional observations, which can be used to op- 
timally design follow-up efforts. The measures 1ZI\ 
and TZIo deserve more research for further under- 
standing. More importantly, theory and practice 
should be incorporated to provide design sugges- 



tions that utilize relative information such as TZI\ 
and corresponding variability measures. 
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